Cooperative Crawling
نویسنده
چکیده
Web crawler design presents many different challenges: architecture, strategies, performance and more. One of the most important research topics concerns improving the selection of “interesting” web pages (for the user), according to importance metrics. Another relevant point is content freshness, i.e. maintaining freshness and consistency of temporary stored copies. For this, the crawler periodically repeats its activity going over stored contents (re-crawling process). In this paper, we propose a scheme to permit a crawler to acquire information about the global state of a website before the crawling process takes place. This scheme requires web server cooperation in order to collect and publish information on its content, useful for enabling a crawler to tune its visit strategy. If this information is unavailable or not updated the crawler still acts in the usual manner. In this sense the proposed scheme is not invasive and is independent from any crawling strategy and architecture.
منابع مشابه
Distributed, Interleaved, Parallel and Cooperative Search in Constraint Satisfaction Networks
In this work, we extend the efficiency of distributed search in constraint satisfaction networks. Our method adds interleaving and parallelism into distributed backtrack search. Moreover, it has a filtering capacity that makes it open to cooperative work. Experimentations show that 1) the shape of phase transition with random problem can be characterized, 2) important speed-up can be achieved w...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملInformation Sharing among Heterogeneous Reusable Agents in Cooperative Distributed Search
Information sharing among heterogeneou~ ~geusable agents in cooperative distributed search systems can greatly affect the quality of solutions and the runtime efficiency of the system. In this paper, we first give a formal description of shareable information in systems where agents have private knowledge and databases and where agents are specifically intended to be reusable. We then present e...
متن کاملParallel Web Spiders for Cooperative Information Gathering
Web spider is a widely used approach to obtain information for search engines. As the size of the Web grows, it becomes a natural choice to parallelize the spider’s crawling process. This paper presents a parallel web spider model based on multi-agent system for cooperative information gathering. It uses the dynamic assignment mechanism to wipe off redundant web pages caused by parallelization....
متن کاملSubjective partial cooperation in multi-agent local search
A partial cooperative model that was recently proposed offers a balance between the two extreme scenarios commonly assumed in multi-agent systems, either completely competitive or fully cooperative agents. Partial cooperative agents act cooperatively in a distributed search process, as long as the outcome satisfies some threshold on their personal utility, otherwise, they act selfishly. While p...
متن کامل